Introduction

In this study we will be predicting which water pumps throughout Tanzania are functional, and which do not work at all based on a number of variables. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. To achieve this we have at our disposal a dataset from drivendata1 in which we have 59400 unique registers and 41 different features. The orginal source is the “Taarifa waterpoints dashboard”2 3 4, which aggregates data from the Tanzania Ministry of Water.

The available data can be broke down into three classes: about what kind of pump is operating, when it was installed, and how it is managed. The dataset also has a terniary target feature: which pumps are functional, which need some repairs, and which don’t work at all. We found many of the features we have are either useless to predict whether the pump is functional or not, or are filled with missing values. There are also groups of three variables which are encompassed with eachother, repeating most of the same information in a hierarchical manner.

Before we found this dataset we considered a dataset from Kaggle, the dataset was whether a subscription user, on a certain music platform, will churn or not5. We found three problems with this dataset and decided it was not optimal for this project. First, all variables were coded in ways which we would not be able to contextualize any of it. To ensure the privacy of their consumers, all data was scrambled and it was impossible for us to bring a better analysis without more concrete data. The second and third, are related with the size and structure of the dataset. The size of the dataset was unfathamabole, with over 32GB of uncompressed data our laptops were struggling to import them, even when we were able to achieve a successful import it was not with the familiar data.frame. On top of this, the structure of the data was burdensome as well. There were 3 different datasets, each with different data and structure. The first one had the information of every client and reasonable enough. The second one had, for each month, the method the client payed for the subscription. The third one had, for each day, the amount of songs that they listened to and the amount of songs that they listend from start to finish, the amount of songs they only listned 50% of, etc. We had in mind joining all three tables and using our usual techniques and algorithms, but how would be join it? The fundamental part of this problem was predicting if a user would churn depending on the amount of songs they listened to, to see if a pattern emerged from the user. If we joined it in the way we are used to we would have unused most of the information.

The following study is broken down in three parts. The first, is a descriptive processing of the data, in which we will report the most relevant features found that relate to the target feature and also correct any mistakes ending up with a clean dataset which we will use in the following parts. The second part, consists of exploration techniques, PCA and clustering. Although, since we only dispose of a few numerical variables, we will not do a regular PCA, but a MCA. Finally, the third part will consist of the predictive algorithms we deemed optimal for our given dataset and return the predicitve quality that returned the best results.

Initially our target variable status_group, which recorded the status of the well in three categories, contained the level functional needs repair which we removed due to the sheer interest of the current study to determinate whether a well is or isn’t functional, a binary target. Besides the study’s aim, the deleted level was scarce, therefore it’s removal makes the response variable be almost perfectly balanced. After removing these values we ended up with 57,588 observations. Second, we decided to also remove the variables which had missing values at either latitude and consequently missing values at longitude, more information on this later, but we did this now to better select the sample.

After first plowing through with all the observations, we ran into problems with the computational costs of some algorithms, such as random forest, was too high, a decision of reducing the database’s size was made. We took a random sample of only 20,000 rows, reducing 66% of the orginal size. Previous to making a sample we checked to see if there was an intrinisc structure to the data, as could be if it was temporal data or any other structure, none was found so we proceeded to create the sample. It was important to check that we did not break the original strucutre of the data as we wanted a sample to correctly sample to whole.

Preprocessing

In this following section we shall talk about what we did to preprocess the data. Only the most relevant plots and statistics will be shown in this section, if you want any other plot not presented here they are appended at the annex.

Our prerpocessing will consist basically of:

Features being removed

After a brief examination and solution proposal, we decided to remove certain variables from the dataset. Overall, variables are removed because of:

  • Containing too many levels: Id, wpt_name, subvillage.

  • Being contained in other variables (_type, _class, _group), these variables have a hierarchical structure each containing the same information from the previous level, sometimes they add a bit more complexity that is not needed: extraction_type_group, extraction_type_class, payment_type, quality_group, quantity_group, source_type, waterpoint_type_group.

  • Not containing relevant information, these either have the same information collected on other similar variables or the values are all the same: region_code, num_private, recorded_by

A list of the removed variables and the amount of unique levels each had.
x
id 20000
wpt_name 14167
num_private 48
subvillage 10068
region_code 26
recorded_by 1
extraction_type_group 13
extraction_type_class 7
payment_type 7
quality_group 6
quantity_group 5
source_type 7
waterpoint_type_group 6

Features which the missing value level had to be recodified

From the remaining 28 variables, some still have to be processed or reclassified. Since there are different types of codifying for the missing values in our dataset:

  • NA codified by ‘0’: longitude, construction_year

  • NA codified by -2e-08: latitude

  • NA codified by unknown: management, management_group, payment, water_quality, quantity, source, source_class

  • NA codified by void character: public_meeting, scheme_management (which also has a None level which we decided to also say was missisng), scheme_name, permit

How do we know to remove those values from Latitude and longitude? They are trickier to see, as we need both of these values together to know if they make sense or not. To view them correctly, we can do it with a map, since we know all the values are well pumps in Tanzania we can probably say that those few values at \((0,0)\) are missing values. The same can be said about \(-2e-08\) which must have been a zero, but as a float number it got messed up in translation.

Latitude and Longitude

Latitude and Longitude

Again since we need both of these values to be able to infer any relevant information we will categorize it to quadrants. Since we know that the country in question is Tanzania we can easily find that it’s latitude is compressed between \((0, -12)\) and it’s longitude is compressed between \((29,41)\)6.

Quadrants of the well pumps

Variables with incorrect class

There are only two variable in the dataset which are incorrectly declared. The first one is date_recorded, which we will have to change the format from factor to Date. The second one is district_code, which is declared as an integer and it should be factor as it is not a numerical variable, even though it is coded as one.

Numerical variables being categorized

Some variables still have to be processed or reclassified with a more complex procedure. These variables are:

x
amount_tsh
gps_height
installer
population
scheme_name

amount_tsh: Total static head (amount of water available to waterpoint), it seems the variable might been wrongly collected. We have a large amount of zeros and a then huge values. It might be because they used different unit types, but our knowledge in water pump physics is lacking so we don’t really know how to interpret it. We thought of a few options:

  • Transform to 4-5 categories
  • NA observations with larger than 5000
  • Delete the feature

We decided to choose the first option, we will create a new variable with 4 different categories. Summary of the new variable:

New variable categorized: amount_tsh_2
x
(-1,1] 13929
(1,100] 2577
(100,5e+03] 3312
(5e+03,1e+07] 182

amount_tsh categorized into 4 different categories

After seeing how the CART separated by this variable in classifying status_group, we deiced to also show the bivariate plot. We can see that the ones that have higer pressure in their water pump tend to work better than those who don’t.

Distribution of amount_tsh by target variable

gps_height: Altitude of the well. After looking at the altitude plots of Tanzania7 and seeing that there aren’t many places with either zero elevation nor sub-sea levels. What we think probably happened is some people wrote elevation and some wrote depth of the well. Also, there aren’t that many places with zero altitude in Tanzania.

Distribution of gps_height, we can clearly see two distributions, but none regarding

Options:

  • Separate in 2 new variables, elevation for the postive values and depth for the negative values.
  • With the of latitude and longitude imputate the real height given from the the geonmaes package
  • Delete

We decided to go with option 2, imputate the real height with the given latitude and longitude. While the first option was quite appealing, we just didn’t dispose of enough data, only 567 which is roughly 0.02835% of the data is not enough. It migh have been interesting if the distribution of wells that had negative values were only the wells of one of the target values, however this was not the case. We imputated the data with the geonames package, which is basically a call to the geonames API8. For most of the values we used the srtm3 (Shuttle Radar Topography Mission) elevation data since we found it was the method it gave the most similar results. However, for the values that this method gave -32768, which are ocean areas that have been masked as “no data”9, we used the GTOPO30 method, which gave different results for these values. For the values that the GTOPO30 function also gave ocean area we manually imputated a value of zero.

Distribution of positive values, ‘Elevation’

Distribution of ‘Elevation’

population population around the well. With 6834 registers that have a population of zero we consider categorizing this variable as well, we also find that the population with only 1 is: 2474which doesn’t make much sense looking at the graph. We might expect that having a working water well they will move closer to it to benefit from it.

Distribution of population according to the target variable

Quantiles of population
x
0% 0
25% 0
50% 30
75% 225
100% 10000

Options:

  • Categorize in levels
  • Keep as it is
  • Delete

We consider categorizing it because this is a feature that we think could be interesting for posterior analysis.

Distribution of population
x
(-0.5,0] 6834
(0,1] 2474
(1,100] 3110
(100,215] 2537
(215,2e+03] 4867
(2e+03,3.05e+04] 178

Manual Fuzzy match

installer: Organization that installed the well. With 2146 different levels, we find very similar names among them.

Even though there are over 2000 levels, the installer could be a relevant variable. Therefore it will be kept but might be ignored in further stages of this study. Before using it, we will try to preprocess it in order to decrease the number of levels.

In the previous code, the lowerCase function has been used and as a result some there has been a decrease of 2146. Also, the number of levels that had a frequency of 1 went from 635 to 571.

Finally, a conceptual level grouping is performed only for levels with a frequency of over a hundred. These are mostly typos or different ways they spelled things. For example unisef \(\rightarrow\) unicef, oxfarm \(\rightarrow\) oxfam, among many others.

The final grouping consists in considering as “other” all levels apearing less than 10 times throughout the database.

Finally, the database has 188 levels. While we wanted to do a fuzzy match for the previous feature, it was simply not feasible, there are too many levels with similar names, but not enough to be able to just join them. It ended up joining values that were not supposed to be joined before values that were supposed to be joined together.

scheme_name Who operates the waterpoint. Similarly to installer we group all the levels apearing less than 10 times into an “other” category.

We reduced from 2697 to only 235 levels.

Finally we join a few levels in other features:

water_quality We have the levels flouride abandoned and salty abandoned which we joined into their corresponding level flouride and salty. As the abandoned is an adjective describing something other than what the feature is trying to depict.

waterpoint_type We joined the level dam into other as we only had 3 observations of dam.

For the remaining character variables are all converted to factor if appropiate and we will do a droplevels to eliminate any residue that might have been left.

Why haven’t we done the fuzzy match? As stated earlier, we tried to do it, however we ran into problems that it joined similar names that were not supposed to be joined before things we thought that they were supposed to be. To do it correctly we would need the help of an expert in the local language to be able to discern what is the same and what is a typo. We also considered joining all the levels with few observations into a lump other (like we did with water_quality), however, doing so, we ended up distorting the original data distribution. The missing values we got were horrible compared to not lumping them together, the same with the models.

Imputing missing data

In order to not bias the analysis we shall remove the rows with missing values in either the latitude or longitude. We can afford to do this because the missing data is very small relatively to the size of the dataset. For the rest of the missing values we will use MICE, however we will not imputate on the categorical variables, because we feel that a missing value is actually more useful to us, we might think that having a missing value in a category is for a reason and that not knowing might help the models in finding better values. After this first draft we might change it to try and imputate some missing values of some more relevant categories. For now, though, we will only imputate the numerical variable, construction_year because it is the only one that has missing values.

First let’s check the amonut of missing values that we have in our dataset. We shall remove these variables to help the MICE function. Secondly, we didn’t actually do the MICE first because since we have variables with many levels we were running in combinatorial problems and the computational time was to high to be able to do anything. The find out what variables we will use for the MICE we use first the aregImpute function from Hmisc package, playing with it we find the optimal variables that we can use and will give us a first estimate of the missing values.

Using the same variables used for aregImpute we finally do the MICE. Although it was probably innecessary because the difference between the both are neglibile. Looking at the following plots we can see that MICE did a pretty good job.

Density of the imputated values vs. the real values

Density of the imputated values vs. the real values

Density of the imputated values vs. the real values by the target feature

Density of the imputated values vs. the real values by the target feature

Descriptive

In this section we will look at some of the more relevant features that relate to the target variable.

Target status_group

Our target variable has three different modalities. Although we could group functional needs repair with non functional as both need repair, and that is the objective of this project finding out which ones need repair. For now let’s keep it this way. We see that the response is not balnaced (58.5% vs. 41.5%).

Latitude and Longitude

Unfortunatley we do not seem to see any clear relationship between the latitude and longitude and if they work or not. Nor any of them, all the bivariate plots are found in the annex.

Latitude and Longitude with target feature, geolocalitzation

Construction year

We found that the variable construction year is pretty corrleated with the target feature, which makes sense since the more recent the well has been built the more sturdy it has to be and hasn’t had the test of time pass it yet. Although seeing the distribution they seem pretty similar.

Distribution of construction year by target variable

Correlation between features

To start the exploration analysis, we first present the results of a correlation plot between features. Observing first the numerical correlation plot we observe nothing interesting. We can see how elevation is correlated between latitude and longitude, and also how latitude and longitude present some correlation. This doesn’t really give us that much information.

For the categorical features, we observe that lga and ward are highly positively correlated, but they both are also very correlated with many other variables. As we explained previously, managment and managment_group, as well as, source and source_class, are very highly positively correlated.

As commented, status_group, our target feature, doesn’t seem to have high correlation with any of the features except ward, but every feature is correlated with ward. There does not seem to be any feature that stands out in how status_group will be predicted.

We will keep for now these features but we won’t use them for the most techniques unless stated otherwise.

Numerical correlation plots

Numerical correlation plots

Categorical correlation plots

Categorical correlation plots

Visualitzation & Exploration

PCA

We have done a PCA with the all the numerical variables from our clean database:

Variables to construct the PCA
x
gps_height
longitude
latitude
construction_year

If we look at the screeplot of the principal components we see that we should take the first 3, the sum of them is more than 80%. Still, as we only have 4 numerical variables versus the 25 categorical, it seems obvious that PCA won’t summarize the cloud as well as other methods. We will look more closely at the MCA instead. Still let’s look at the results of the PCA:

PCA scree plot

PCA scree plot

We have selected 3 principal components in order to preserve at least the 80% of the variability.

Visualization of the results:

First factorial plane

First factorial plane

As we suspected, the PCA from the 4 numerical features isn’t very informative. Looking at the first factorial plane we can say that longitude and latitude return little information and aren’t relevant to explain the target feature. We do see, however, how the more recent the construction year of the well (indicated by larger values), the more probability we have that a well is in functional condition, the same can be inferred with the elevation of the well. It seems the more elevated the well’s are and the more recent they have been constructed the more likely they are to be functional. The inverse can also be applied, the older and lower elevation the wells we go, we will find more non functional wells. We have more evidence peenting the third factorial plane (constructed from the second and third components) that the construcion year might be quite relevant in if a well needs repair or not. Which in hindsight seems pretty obvious.

Third factorial plane

Third factorial plane

The second factorial plane is quite confusing and isn’t really relevant to the discussion in finding information regarding the target feature. In the second factorial plane we find the distribution of the target to be almost random.

MCA

As stated before, since we have many more categorical variables than numerical, it is more interesting for us to analyze the MCA than the PCA, since it will capture more information, We decided to do the follwing selection:

  • Active categorical features: basin, region, public_meeting, permit, extraction_type, payment, water_quality, quantity, source, source_class, waterpoint_type, quadrants, amount_tsh_2

  • Supplementary categorical features: date_recorded, funder, installer, district_code, lga, ward, scheme_management, scheme_name, management, management_group, status_group, population_2

  • Supplementar numerical features: gps_height, longitude, latitude, construction_year

Note that we are using the features we coded as categorical from numeric in the active variables we input, except population_2.

Dimension of the individuals space: 103

Number of active character variables: 13

Inertia in case of independency: 0.0769231

Looking at the screeplot is difficult as it’s hard to decide where exactly we need to cut. We decided not to use all variables because we started to get singularity problems, we decided to remove the variables which have many different levels, installer, scheme_managment, etc. We have components that explain little of the data, the first one only explains 5.0036132% and the second one only 4.3870521%%.

MCA scree plot

MCA scree plot

To select the number of significative dimensions we will start by eliminating the trivial inertia (0.0769231). We will achieve this by selecting the eigenvalues that are larger than \(1/p\)

We have selected 34 as signficitative components.

MCA scree plot with significant dimensions

MCA scree plot with significant dimensions

Plot of modalities

First factor plane MCA

Unfortunatley, we aren’t any close to finding any relevant features that could help us differentiate between the target categories. Since they are both so close together it’s hard to say what other features might be more characteristic for one of the values than for the other. Checking for the first 5 planes we find similar results.

Clustering

Since we have such a big dataset the way we will cluster is by first doing 2 K-means with a relative big \(K\), in this case 14. Afterwards we will create the contingency table of the two k-means clusterings and make a hierarchical clustering of those centroides. To avoid noise we will do all this over the princiapal components that we found significative.

K-means contingancy
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 10 0 0 0 0 0 1 0 0 1224
0 0 0 0 4 0 381 0 0 2 0 17 1435 0
2355 0 0 0 25 0 94 0 2 0 0 1171 0 0
0 0 0 0 4 0 1 0 1 32 2130 0 8 0
0 0 0 0 4 0 0 815 0 0 0 2 0 0
5 0 0 0 24 0 1 0 913 54 1 1 2 0
0 710 0 0 28 0 0 0 0 0 0 2 0 0
0 0 92 18 41 0 34 14 1 2 0 1225 5 0
0 0 0 0 6 0 6 0 64 1383 9 0 0 0
0 0 440 0 90 6 0 0 0 0 0 10 149 0
0 0 925 53 2 432 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 315 0 0 0 0 0
0 0 0 1078 10 0 8 0 0 0 0 0 0 0
0 0 0 0 2 522 4 0 0 12 1579 1 0 3

We now have 67 clusters. Let’s do the hiearchichal cluster.

Dendogram of the previous contingancy table. We try 3 cuts for now.

Dendogram of the previous contingancy table. We try 3 cuts for now.

We do K-means with 3 classes and we take the previous centroids for the clustering initialization.

The size of the clusters are:

Size of each cluster 1 to 3
x
6896
8660
4444

Finally, let’s see the partition visually on the data:

K-means clustering after 2 successive K-means and a hierchichal cluster

K-means clustering after 2 successive K-means and a hierchichal cluster

Profiling

In this following part we shall explain how the clusters are organized by profiling them with the most informative data. We only present simple of the most relevant barplots here to speed up the process.

Basin

The first, and quite relevant, way to separate the profiling is through the basin. We can see how for class 1, we have more a larger proportion of internal and wami / Ruvuu water basins than both other cuts. Secondly, for class 2, we can see how it’s the only one with Lake Nyasa, it also has a larger proportion of the river Pangani than class 1 the same can be said from the river Rufiji. Both these two previous cuts are at the southern coast / Ruvuma. Finally, we can separate class three because they are the only ones with Lake Victoria as a basin. Also, for cut 3, they are not close to any of the rivers mentioned previously, only a tiny proportion are close to Rufiji.

We can clearly see that the presence of the lakes Nyasa and Victoria are able to discern between cut 2 and 3, also if they have more rivers close.

Profiling cluster: basin

Extraction type

Starting by cut 2, we know that if the extraction type is not ksb, india mark ii, or gravity it will not be from this class. Speaking of gravity, if we find a well that is extracted by gravity, it will most probably be from class 2, and if not it will be from class 3. If we find a method of extraction that isn’t used by cut 2 we won’t be sure where to put it, although if it’s an india mark ii it will more probably be from cut 1, the same can be said about ksb.

Profiling cluster: extraction_type

Payment

Starting with the third class, these are the places that mostly don’t ever pay. Maybe because they are also the places that don’t have that many population around them. Otherwise, we can see that class 1 and class 2 are separated by if they pay according to a timeframe, or if the payment payed each bucket. For the former, we can see that the wells in class 2 usually pay more in an accorded timeframe. For the latter, they usually pay for the bucket or never pay.

Profiling cluster: payment

Water quality

If we remove the soft (by clicking the legend title), we have a much better representation. While most of them have soft water, if we find people with salty taste it will most probably be from the first cut, or the third. The inverse can be said if the water has a milky taste. Finally, if the water has a flouridic taste we will know for sure that it is from the second class.

Profiling cluster: water quality

Source

If the water source is a lake, it will be from cut 3 (even though we have clearly seen that cut 2 also has lakes, they might have considered it as a spring). Other intersting things we can see if that most machine dbh are from class 1, while most river sourced water is from class 2.

Profiling cluster: Source

Quadrants & latitude and longitude

With the two following plots we can easily see how the cluster separated geographically the locations, we see cut 3 is mostly in the northern part of the country while cut 2 is around the sides and follows the belt center of tanzania. Finally we see cut 1 is a bit more tricky but it’s mostly located near the sea and in the center of tanzania. Still looking at the quadrants we see how cut 2 and 1 are very similar.

Profiling cluster: quadrants

Profiling cluster: geo location

Elevation

An interesting result is that cut 1 is mostly located around 680m of elevation and can be found at sea level. Cut 3 seems to be inside a plane, with a very small differential in elevation, if we disregard the outliers.

Profiling cluster: elevation

Target feature

Finally, it should be stated that both levels of the response variable are quite balanced in each class. Although in the second class we do find a higer proportion of functional (68.38% vs 31.61%) and the same in the third class, but not so much (56.1% vs 43.9%).

Profiling cluster: target variable

Since we didn’t find any class to be related closely to the target feature we end our profiling here. It’s not getting us any closer to understand how the variable is distributed and what will help predict it.

Predictive algorithms

The final aim of this project is to be able to predict the targer feature status_group. This variable has two levels, functioning well and completely unfunctional. Therefore it’s a variable that overall shows the state of the well and it could be really valuable to be able to predict whether a well has a high or low chance of breaking. Since we have such a big dataset we only report the results of the algorithms.

Validation protocol

Data slicing

Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.

Before modeling, a list of 80% of the rows in the original dataset (with 14001 rows) is created to use for training. While 20% is kept for testing (5999 rows).

Class distribution is assessed in order to check how well balanced are the different classes. As it can be observed, the wells that are functional but need repair are a minority. Therefore, in further stages of this report it might be merged with one of the other classes. In the following tables we can check the distribution of both datasets we now created.

Train dataset
freq percentage
functional 8165 58.31726
non functional 5836 41.68274
Validation dataset
freq percentage
functional 3499 58.32639
non functional 2500 41.67361

Cross - validation

Cross-validation is a method of estimating the expected prediction error while ensuring the model is not over fitting. There are many types of cross-validation.

Hold Out Method: Training Sample (70%) vs Testing sample (30%) \(\rightarrow\) if the error rate is similar on both \(\rightarrow\) not overfited. Advantage: low computing time. This method is prone to sample bias.

K-Fold Cross Validation: The sample is splited into K equal sub size samples. All of the models used but the random forest have been calculated through a 10-fold cross validation. This has been to the huge computational cost of the tuning of the random forest so we had to reduce it to a 5-fold cross validation. Prediction error = Average error. As the response is categorical: Accuracy. The result you get is the average accuracy of all the folds. Advantages: how data is divided matters less as selection bias is no longer present.

Algorithms will be performed using a 10-fold crossvalidation. This slows down the search process, however it’s intended to limit and reduce overfitting on the training set. It won’t remove overfitting entirely, so holding back a validation set for final checking is a great idea.

Avaluation criterium

These are the metrics we have used to evaluate algorithms on our dataset:

Accuracy is the percentage of correctly classified instances out of all the instances. It is more useful on a binary classification than multi-class classification problems, because it can be less clear exactly how the accuracy breaks down across those classes in muli-class.

Kappa or Cohen’s Kappa10 is similar to classification accuracy, except that it is normalized at the baseline of random chance on your dataset. It is a more useful measure to use on problems that have an imbalance in the classes (e.g. 70-30, split for classes 0 and 1, and you can achieve 70% accuracy by predicting all instances are class 0). It basically tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class.

Models considered

Overall, this study has tried to select a wide range of different models. Using both parametric and non-parametric approaches, we considered every model according a specific aim.

Firstly, knn and lda have been chosen due to it’s computational efficiency. Which, in a database as large as ours, can be an important factor. Both methods, are known for it’s simplicity and will be used as an accuracy baseline.

Cart method has been used due to the visual ease, it provides a low level techincal understanding fo the prediction it returns.

On a similar approach, logistic regression has been used mainly to be able to evaluate the effect of each variable, and also to add a parametric model.

Finally, in order to improve the accuracy of our prediction two black-box models have been chosen: Both the tree bagging and random forests algorithms were chosen. The former due to it’s ability to manage high number of levels. The latter due to it’s known prediction accuracy.

Training results

Logistic regression

The following binary logistic model is used to estimate the probability of a binary response based on all of our predictors and variables. If this were to be the selected model, it would allow to analyse the presence of a risk factor increases the odds of a given outcome by a specific factor. In this case, the tuning is done automathically by the caret11 itself. Trying several combinations of the parameters cost (c), tolerance (epsilon) & Loss function (primal or dual).

## Regularized Logistic Regression 
## 
## 14001 samples
##    15 predictor
##     2 classes: 'functional', 'non functional' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12601, 12601, 12601, 12601, 12601, 12601, ... 
## Resampling results across tuning parameters:
## 
##   cost  loss       epsilon  Accuracy   Kappa     
##   0.5   L1         0.001    0.7582341  0.48460167
##   0.5   L1         0.010    0.7483774  0.46234128
##   0.5   L1         0.100    0.7213753  0.40777796
##   0.5   L2_dual    0.001    0.5762440  0.01874711
##   0.5   L2_dual    0.010    0.5475179  0.04012736
##   0.5   L2_dual    0.100    0.5401131  0.01963692
##   0.5   L2_primal  0.001    0.6840968  0.31904311
##   0.5   L2_primal  0.010    0.5693900  0.02171018
##   0.5   L2_primal  0.100    0.5696042  0.02063917
##   1.0   L1         0.001    0.7590915  0.48721584
##   1.0   L1         0.010    0.7493057  0.46516642
##   1.0   L1         0.100    0.7135195  0.38940072
##   1.0   L2_dual    0.001    0.4599008  0.01939257
##   1.0   L2_dual    0.010    0.5246079  0.05603794
##   1.0   L2_dual    0.100    0.5605180  0.03397983
##   1.0   L2_primal  0.001    0.6840254  0.31890742
##   1.0   L2_primal  0.010    0.5693900  0.02171018
##   1.0   L2_primal  0.100    0.5696042  0.02063917
##   2.0   L1         0.001    0.7589478  0.48733030
##   2.0   L1         0.010    0.7479495  0.46227215
##   2.0   L1         0.100    0.7185209  0.40196670
##   2.0   L2_dual    0.001    0.5018274  0.02415690
##   2.0   L2_dual    0.010    0.5578969  0.04206258
##   2.0   L2_dual    0.100    0.5239726  0.02379684
##   2.0   L2_primal  0.001    0.6840254  0.31890742
##   2.0   L2_primal  0.010    0.5693900  0.02171018
##   2.0   L2_primal  0.100    0.5696042  0.02063917
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 1, loss = L1 and epsilon
##  = 0.001.
Result of logistic regression training

Result of logistic regression training

Linear distriminant analysis

LDA provides a less direct approach to modeling the predicted probabilities given some set of predictor(s) X. This algorithm models the distribution of the predictors X separately in each of the response classes (given Y’s), and the uses Bayes’ theorem to flip them around into estimates. When these distributions are assumed to be normal, it turns out that the model is very similar to the logistic regression.

## Linear Discriminant Analysis 
## 
## 14001 samples
##    15 predictor
##     2 classes: 'functional', 'non functional' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12601, 12601, 12601, 12601, 12601, 12601, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7679465  0.5050678

CART

We need to specify the complexity parameter in order to prune our tree. We will choose the complexity parameter associated with the minimum possible cross-validated error.

## CART 
## 
## 14001 samples
##    15 predictor
##     2 classes: 'functional', 'non functional' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12601, 12601, 12601, 12601, 12601, 12601, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.03298492  0.6922414  0.3479641
##   0.05705963  0.6708078  0.3028181
##   0.17820425  0.6245308  0.1619916
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03298492.
Result of CART training

Result of CART training

After viewing the results, we selected the CART with the complexity parameter of: 0.0329849.

Result of CART training, but pretty

Result of CART training, but pretty

The resulting tree is pretty simple, we define the splits according to construction year and amount tsh, we split according to how old the well is and the amount of pressure the well has. The model results with an accuracy of around 70%, a pretty good result considering the simplicity of the model.

KNN

The most commonly used distance measure is Euclidean distance, also known as the usual (simple) distance. The usage of Euclidean distance is highly recommended when data is dense or continuous, it’s the best proximity measure.

KNN classifier is also considered to be an instance based learning / non-generalizing algorithm. It stores records of training data in a multidimensional space. For each new sample & particular value of K, it recalculates Euclidean distances and predicts the target class. So, it does not create a generalized internal model.

## k-Nearest Neighbors 
## 
## 14001 samples
##    15 predictor
##     2 classes: 'functional', 'non functional' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12601, 12601, 12601, 12601, 12601, 12601, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6631714  0.2953107
##   7  0.6651705  0.2978820
##   9  0.6717413  0.3088164
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
Result of KNN training

Result of KNN training

The best parameter is 9, because it’s the one that returns the best accuracy (around 67%).

Random Forest

We will use the popular Random Forest algorithm as the subject of our algorithm tuning. When tuning an algorithm, it is important to have a good understanding of the algorithm so that you know what effect the parameters have on the model you are creating.

In this case study, we will stick to tuning two parameters, namely the mtry and the ntree parameters that have the following effect on our random forest model. There are many other parameters, however these two parameters are perhaps the most likely to have the biggest effect on your final accuracy.12

ntree: Number of trees to grow. This should not be set as a too small number, to ensure that every input row gets predicted at least a few times.

mtry: Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3) but as we will tune the models this default values won’t be used.

## 14001 samples
##    15 predictor
##     2 classes: 'functional', 'non functional' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 11201, 11201, 11201, 11200, 11201, 11201, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ntree  Accuracy   Kappa     
##    1      50   0.6127894  0.08541269
##    1     250   0.6099566  0.07634886
##    1     750   0.6094325  0.07457161
##    1    1000   0.6094087  0.07457126
##    1    1500   0.6089801  0.07333210
##    3      50   0.7508988  0.46344766
##    3     250   0.7522084  0.46571215
##    3     750   0.7534942  0.46828592
##    3    1000   0.7548036  0.47113040
##    3    1500   0.7533037  0.46744698
##    8      50   0.8062283  0.59419158
##    8     250   0.8095375  0.60073127
##    8     750   0.8100374  0.60177442
##    8    1000   0.8108230  0.60342186
##    8    1500   0.8102516  0.60225071
##   15      50   0.8154895  0.61617489
##   15     250   0.8183941  0.62178232
##   15     750   0.8184893  0.62209506
##   15    1000   0.8179893  0.62098324
##   15    1500   0.8189654  0.62312591
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 15 and ntree = 1500.
Result of Random Forest training

Result of Random Forest training

SVM

The principle behind an SVM classifier (Support Vector Machine) algorithm is to build an hyperplane separating data for different classes. This hyperplane building procedure varies and is the main task of an SVM classifier. The main focus while drawing the hyperplane is to maximizing the distance from the hyperplane to the nearest data point of either class. These nearest data points are known as Support Vectors.

We will try to build a model using Non-Linear Kernel like Radial Basis Function. In Radial kernel, it needs to select proper value of Cost (C) parameter and sigma, \(\sigma\), parameter.

sigma: In case of a probabilistic regression model, the scale parameter of the hypothesized (zero-mean) laplace distribution estimated by maximum likelihood.

C: cost of constraints violation (default: 1), it is the C-constant of the regularization term in the Lagrange formulation.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 14001 samples
##    15 predictor
##     2 classes: 'functional', 'non functional' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12602, 12601, 12601, 12600, 12601, 12601, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C    Accuracy   Kappa    
##   0.025  0.5  0.7861580  0.5481857
##   0.025  1.0  0.7942284  0.5674234
##   0.025  2.0  0.7973004  0.5754178
##   0.050  0.5  0.7814443  0.5376455
##   0.050  1.0  0.7905153  0.5606497
##   0.050  2.0  0.7905857  0.5620461
##   0.060  0.5  0.7783733  0.5304241
##   0.060  1.0  0.7883010  0.5557269
##   0.060  2.0  0.7880857  0.5563207
##   0.070  0.5  0.7767304  0.5259627
##   0.070  1.0  0.7848724  0.5481515
##   0.070  2.0  0.7864432  0.5523978
##   0.080  0.5  0.7745156  0.5202600
##   0.080  1.0  0.7825864  0.5428197
##   0.080  2.0  0.7837289  0.5463706
##   0.090  0.5  0.7720865  0.5143678
##   0.090  1.0  0.7809438  0.5385864
##   0.090  2.0  0.7820863  0.5425699
##   0.100  0.5  0.7696582  0.5083437
##   0.100  1.0  0.7775868  0.5311464
##   0.100  2.0  0.7813007  0.5403666
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.025 and C = 2.
Result of SVM training

Result of SVM training

Bagging

Bagging is a way to decrease the variance of your prediction by generating additional data of training from your original dataset using combinations of repetitions to produce multisets of the same cardinality/size as your original data. This model has no tuning parameters.

## Bagged CART 
## 
## 14001 samples
##    16 predictor
##     2 classes: 'functional', 'non functional' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12601, 12601, 12601, 12601, 12601, 12601, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8072996  0.5989088

Comparison of the best results

In the following plot, we present a compilation of the results of each of the algorithms used in the validation set.

## 
## Call:
## summary.resamples(object = results)
## 
## Models: lg, lda, cart, knn, bg, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lg   0.7401856 0.7529012 0.7585714 0.7590915 0.7683929 0.7750000    0
## lda  0.7530335 0.7657561 0.7678571 0.7679465 0.7699589 0.7807143    0
## cart 0.6688080 0.6748784 0.6832143 0.6922414 0.7110714 0.7298070    0
## knn  0.6435714 0.6610756 0.6692857 0.6717413 0.6865509 0.6957143    0
## bg   0.7864286 0.7968947 0.8050000 0.8072996 0.8203245 0.8258387    0
## svm  0.7851535 0.7906773 0.7996429 0.7973004 0.8037853 0.8071429    0
## rf   0.8014286 0.8132470 0.8232143 0.8211567 0.8289593 0.8377412    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lg   0.4449915 0.4726305 0.4846757 0.4872158 0.5073925 0.5250845    0
## lda  0.4735051 0.5019750 0.5043211 0.5050678 0.5110533 0.5316001    0
## cart 0.2907180 0.3123216 0.3276808 0.3479641 0.3945876 0.4293850    0
## knn  0.2501966 0.2891606 0.3042604 0.3088164 0.3385883 0.3586324    0
## bg   0.5574566 0.5764598 0.5943510 0.5989088 0.6268629 0.6366230    0
## svm  0.5498404 0.5600603 0.5796607 0.5754178 0.5894458 0.5971180    0
## rf   0.5864080 0.6124425 0.6311699 0.6274054 0.6438313 0.6633386    0
Validation set results of each algorithm

Validation set results of each algorithm

Final model

As it can bee seen in the dotplot, the best model both in accuracy and kappa is the Random Forest that has been tuned through changes in the hiperparameters mtry= 15 and ntree held constant as it had been found that there was no effect of the forest size into accuracy.

We resent the following confusion Matrix with the test data:

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       functional non functional
##   functional           3080            651
##   non functional        419           1849
##                                           
##                Accuracy : 0.8216          
##                  95% CI : (0.8117, 0.8312)
##     No Information Rate : 0.5833          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6282          
##  Mcnemar's Test P-Value : 1.643e-12       
##                                           
##             Sensitivity : 0.8803          
##             Specificity : 0.7396          
##          Pos Pred Value : 0.8255          
##          Neg Pred Value : 0.8153          
##              Prevalence : 0.5833          
##          Detection Rate : 0.5134          
##    Detection Prevalence : 0.6219          
##       Balanced Accuracy : 0.8099          
##                                           
##        'Positive' Class : functional      
## 

It can be seen that the accuracy obtained in the test dataset is the same that we had got in the train dataset. As a result, it can bee infered that the modeling is representative of the real data and that there is no overfitting.

Therefore, accuracy in the train dataset is 0.8211567 and kappa 0.6274054 while on the test dataset accuracy is 0.8218 and kappa is 0.6283.

Moreover, it should be highlighted that the 95% confidence interval is pretty narrow, conveying this way that it should be a robust model if it were to be used in real life predictions.

Eventhough the kappa value is the highest among all the other models that have been calculated, it is still significatively lower than the accuracy. Through research it has been found that kappa values between 0.61 and 0.8 to be substantially good by Landis and Koch eventhough they supplied no evidence to support it.

Finally, sensitivity is at 0.88. Therefore the proportion of positives that are correctly identified as such is 88%. On the other hand, specificity is at 0.73. Therefore the proportion of negatives that are correctly identified as such is 73%. What this information indicates is that the final model of this study is better at predicting functioning wells rather than the non-functioning ones.

We can plot to see the relevance of each variable inside the Random Forest:

Result of Random Forest training, importance of each vairable

Result of Random Forest training, importance of each vairable

Scientific and personal conclusions

The final conclusions of this study have been gathered around 4 main conceptual blocs:

The first one being the difficulties of working with this kind of on field data. Where all the observations are very sensitive to the data collection. As a consequence of this sensibility of the data it has been determined that the present dataset might be widely wrongly collected and as a result response prediction is harder if not inaccurate. Also, it has a relevant impact in the preprocessing stage, not only increasing the working time but also the number of inputed variables, again decreasing as a result the potential accuracy.

The second making reference to variable importance, it has been observed that overall geographical and time variables have a great effect over the response variable and therefore in the predictive models. While it has been found that others such as legal variables have had almost zero effect. Of course this statement could be expected as something as a well is strongly related to the environment and its surroundings. But it should also be brought up that the legal variables such as funder or installer were the ones to have an insane amount of levels, so all previous comments might be subject to how the levels have been treated.

Then, it is important to highlight that computational complexity of the used predictive algorithms applied to a huge dataset has resulted into a very demanding computation process. Making it difficult to process everything, having up to 4 computers runing our code throughout night and day for over a week. As a result, time planning has been a key factor on the project developement.

Finally, it has been found that hiperparameter tuning is a very important aspect of prediction model building. As it allows adjusting the chosen algorithms to each specific database. This is why a lot of the work done has been tuning oriented and has allowed this study to obtain better accuracy results and an improved prediction overall.


  1. https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/

  2. https://dashboard.taarifa.org

  3. http://taarifa.org/

  4. https://github.com/taarifa/TaarifaWaterpoints#taarifa-waterpoints

  5. https://www.kaggle.com/c/kkbox-churn-prediction-challenge

  6. https://www.mapsofworld.com/lat_long/tanzania-lat-long.html

  7. http://floodmap.net/Elevation/CountryElevationMap/?ct=TZ

  8. http://www.geonames.org/

  9. http://www.geonames.org/export/web-services.html#srtm3

  10. http://www.pmean.com/definitions/kappa.htm

  11. https://topepo.github.io/caret/index.html

  12. https://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/